White Wine Quality Analysis by Navpreet Singh
This report explores the data set of white wines of different quality and
chemical composition. The data set contains 4898 observations of 12 variables. This report contains Univariate, Bivariate and Multivariate analysis of different variables of the dataset.
## [1] 4898 12
This report explores a dataset containing different quality and chemical composition of over 4800 wines.
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ quality.bucket : Factor w/ 3 levels "(0,5]","(5,7]",..: 2 2 2 2 2 2 2 2 2 2 ...
- Since quality is a discrete variable having values between 3 and 9, converted it to a factor categorical variable.
- Created a new categorical variable quality.bucket using cut method by categorizing different quality scores as (0,5] - Poor, (5,7] - Average and (7,10] - Good.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
- Most of the wines have a score of 5 and 6. Only a small number of wines are very good quality wines.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.bucket
## 3: 20 (0,5] :1640
## 4: 163 (5,7] :3078
## 5:1457 (7,10]: 180
## 6:2198
## 7: 880
## 8: 175
## 9: 5
Univariate Plots Section
Quality Distributions

In this plot we can see the distribution of Quality of wines, Lowest scoring wines have a score of 3 while highest score is 9 for some wines. Most of the wines falls under the score of 5 to 7.
Distribution of quality as Good, Poor and Average

This histogram divide the wines in Poor, Average and Good category. Most of the wines are under average catogory.
Distribution of all quantative variables

The plot above gives the quick glance at the distribution of all quantitative variables present in the data. Below is the overview of some of the important variables:
- Alcohol distribution is very dispersed between all the values.
- Sulphates shows the signs on bimodel distribution.
- pH and total sulfurdioxide looks evenly spreaded and normally distributed.
- Distribution of Chlorides is very less dispersed.
Alcohol distribution

As we can see from the boxplot, the varience in the alcohol quantity is large, with middle 50% of wines contains an alcohol percentage in the range of 9.5% to 11.5%
pH Distribution

The pH distribution is almost normal with small varience. But it contains a number of outliers. Most of the wines are in the pH range of 3-3.3
Residual Sugar Distribution

The distribution of residual.sugar is rightly skewed. By taking a log10 of the distribution transformed to be bimodal.
Volatile Acidity Distribution

- The volatile acidity distribution is slightly right skewed.
- The volatile acidity of different wine qualities are in second plot shown above. It is clear from the distribution that high quality wines have less volatile acidity content
Density Distribution

Most of the values (98 %) are distributed in the range of 0.99-1.00 g/cm3. The density variable also have some outliers which are stretching the tail of distribution to 1.04 g/cm3.
Total Sulfur Dioxide Distribution

- The first plot shows the distrubution of total sulfur dioxide. The distribution has a small tail. Most of the total sulfur dioxide seems to a have a spread between 70 mg/dm3 and 210 mg/dm3.
- The second plot represents the square root transformation of the distribution of total sulfur dioxide. It eliminated the tail and the values are distributed normally.
Univariate Analysis
What is the structure of your dataset?
The dataset contains 4898 observations of 12 variables. I have created one more variable quality.bucket which categorized the wine as “Good”, “Average” or “Poor” based on its quality score.
What is/are the main feature(s) of interest in your dataset?
The main feature of interests that I observed are density, alcohol and quality.
What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?
I think volatile acidity, Residual Sugar, pH and all different type of supfur dioxide features will help in the investigation.
Did you create any new variables from existing variables in the dataset?
I have created one extra feature quality.bucket which categorized the wine as “Good”, “Average” or “Poor” based on its quality score.
Bivariate Plots Section
Relationship between different variables

The above plot shows the relation of each variable present in dataset with every other variable present. We will explore some of them in more details.
Alcohol and Quality:

As we can see from the plot above, More quantity of alcohol is one of the major driving force in the quality of wines.
Free Sulfur Dioxide VS Total Sulfur Dioxide

The plot shows a strong relationship between free Sulfur dioxide and total sulfur dioxide. This is because of the reason that free sulfur dioxide is also a part of total sulfur dioxide.
Acidity and pH:

The fixed acidity is having a strong negative relationship with pH for all 3 quality buckets. This is very obvious behaviour since pH and acidity are inversly proportional to each other.
Density VS Alcohol

The plot shows a negative relation between density and alcohol. Since, increased density of water make alcohol more diluted and its concentration will be less.
Density VS Residual Sugar

The plot above is showing a positive relationship with most values in the range of 0.99-1 g/cm3 for density and 0-15 g/dm3 for residual sugar.
Citric Acid VS Fixed Acidity

Citric Acid and Fixed Acidity does not show very strong relationship. Most of the values are clustered in 0.2-0.4 g/dm3 for citric acid and 5-9 g/dm3 for fixed acidity.
Mean of Alcohol by Quality

The mean values of alcohol content surprisingly shows a dip at a quality level of 5 but again increased with increasing quality levels.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?
- Quality of wine is majorly depends upon the alcohol content in the wine. There is a positive relation between alcohol and quality of wine. It can be used to predict the quality of wine.
- Density of water shows a negative relation with alcohol as more water make alcohol more diluted making less alcohol content present in the wine and affect its quality.
- Acidity is having a strong negative relationship with pH. This is very obvious behaviour since pH and acidity are inversly proportional to each other.
Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?
- Density and Residual sugar exhibit strong positive relationship which was surprising for me. I was assuming that with increasing density, sugar level should come down.
- I was hoping to see a strong positive relationship between citric acid and fixed acidity. But the relation was quite weak.
What was the strongest relationship you found?
Multivariate Plots Section
Alcohol VS Density VS Quality Buckets

The plot above display middle 95% of density and alcohol data. I have separated the data on the bases of quality buckets. As we can see from the plot, the quality of wine increasing as the alcohol content is increasing and density of water is decresing.
Fixed Acidity VS pH VS Quality Buckets

The plot above is among fixed acidity, pH and quality of wines. pH and acidity have strong negative relationship with each other. The good quality wines have slightly higher levels of pH for the same acidity levels as compare to low quality wines. It means good wines are less acidic in nature.
Free Sulfur Dioxide VS Total Sulfur Dioxide VS Quality

The above diagram shows 99% middle of free and total sulfur dioxide data. As we can see from the diagram above that for high quality wines, the increase in total sulfur dioxide values is very low as compare to low quality wines.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
- First relationship I observed is among alcohol, water density and quality of wines. The quality of wines is increasing as the alcohol content is increasing and density of water is decreasing.
- Second relationship I observed is among fixed acidity, pH and quality of wines. pH and acidity have strong negative relationship with each other, but the point to noted here is that good quality wines have slightly higher levels of pH for the same acidity levels as compare to low quality wines. It means good wines are less acidic in nature. -Third relationship I observed is among free sulfur dioxide, total sulfur dioxide and quality of wines. For high quality(especially score 9) wines, the total sulfur dioxide content is incresing at slower rate as compare to wines with low quality.
Were there any interesting or surprising interactions between features?
The first relationship among alcohol, water density and quality of wines is interesting one. highly alcoholic wines are termed as good quality wines. People like to drink wines which have highly concentrated alcoholic content which can give them easy high.
Final Plots and Summary
Plot One - Density distribution by Quality

Description One
This is a boxplot of density of water for different qualities of wine. This plot shows the negative influence of water density on the quantity of wine. As with decreasing density, quality of wine is incresing.
Plot Two - Alcohol by Quality Buckets

Description Two
This is the plot of alcohol density per quality bucket. It shows that more quantity of alcohol is one of the major driving force in the quality of wines.
Plot Three - Acidity VS pH

Description Three
This is the plot among fixed acidity, pH and quality of wines. pH and acidity have strong negative relationship with each other. Good quality wines have slightly higher levels of pH as for the same acidity levels as compare to low quality wines. It means good wines are less acidic in nature.
Reflection
In this report, exploratory data analysis was performed on a data set of white wines of different quality and chemical composition. The data set contains 4898 observations of 12 variables.
The analysis started with univariate analysis where single varibale was analyzed. Followed by bivariate analysis where the relationship between two variables was analyzed. It ended with multivariate analyses where relation between more than 2 variables have been analyzed.
Following are the key points of the analysis performed:
- The amount of alcohol in the wine is one of the major parameter in the quality of wine. with wines having more alcohol got good score in wine quality test as compare to wines with less alcohol.
- The quantity of water density in the wine also plays a key role. It bears a negative relationship with the quality of wine. With the increase in water density, The qualiity of wine decreases.
- As with various acids present in wine, I observed that as the quantity of acids increases, the pH value is getting decreases as so the quality of wines. The effect is not very high still to be noted.
Struggles and Difficulties : I don not have much prior knowledge about the wine industry and how the quality of wine is tested. During this analysis I came to know about various components(chemicals) which are used in wine making process. Knowing how these different chemicals affect wine quality was major chellenge for me.
Success : While drawing Univariate and Bivariate plots, I came to know about the affect of some of the chemicals (e.g. alcohol, pH, density, acidity) on wine quality. While drwaing these plots I came to know more about data and their relationships.
The analysis can be carry forward in coming future with more latest data since present data in from year 2009. Also, this data contains varients of only Portuguese “Vinho Verde” wine. It will be much better if other wines samples also include in the dataset to make it more diversify.